Partitioning Parallel Documents Using Binary Segmentation
نویسندگان
چکیده
In statistical machine translation, large numbers of parallel sentences are required to train the model parameters. However, plenty of the bilingual language resources available on web are aligned only at the document level. To exploit this data, we have to extract the bilingual sentences from these documents. The common method is to break the documents into segments using predefined anchor words, then these segments are aligned. This approach is not error free, incorrect alignments may decrease the translation quality. We present an alternative approach to extract the parallel sentences by partitioning a bilingual document into two pairs. This process is performed recursively until all the sub-pairs are short enough. In experiments on the Chinese-English FBIS data, our method was capable of producing translation results comparable to those of a state-of-the-art sentence aligner. Using a combination of the two approaches leads to better translation performance.
منابع مشابه
Detection of changes in variance using binary segmentation and optimal partitioning
This work explores the performance of binary segmentation and optimal partitioning in the context of detecting changes in variance for time-series. Both, binary segmentation and optimal partitioning, are based on cost functions that penalise a high amount of changepoints in order to avoid overfitting. Analysis is performed on simulated time-series; first on Normal data with constant but unknown...
متن کاملMulti-organ Segmentation Using Vantage Point Forests and Binary Context Features
Dense segmentation of large medical image volumes using a labelled training dataset requires strong classifiers. Ensembles of random decision trees have been shown to achieve good segmentation accuracies with very fast computation times. However, smaller anatomical structures such as muscles or organs with high shape variability present a challenge to them, especially when relying on axis-paral...
متن کاملBinary Space Partitioning and Sparse Geometric Wavelets Representation for Image Compression
For low bit-rate compression applications, segmentation-based coding methods provide, in general, high compression ratios when compared with traditional (e.g., transform and subband) coding approaches. In this paper, we present a segmentation based image coding method that divides the desired image using binary space partitioning (BSP). Geometric wavelet is a recent development in the field of ...
متن کاملTime Complexity Analysis of Binary Space Partitioning Scheme for Image Compression
— Segmentation-based image coding methods provide high compression ratios when compared with traditional image coding approaches like the transform and sub band coding for low bit-rate compression applications. In this paper, a segmentation-based image coding method, namely the Binary Space Partition scheme, that divides the desired image using a recursive procedure for coding is presented. The...
متن کاملHigh Performance Implementation of Fuzzy C-Means and Watershed Algorithms for MRI Segmentation
Image segmentation is one of the most common steps in digital image processing. The area many image segmentation algorithms (e.g., thresholding, edge detection, and region growing) employed for classifying a digital image into different segments. In this connection, finding a suitable algorithm for medical image segmentation is a challenging task due to mainly the noise, low contrast, and steep...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006